home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Ian & Stuart's Australian Mac: Not for Sale
/
Another.not.for.sale (Australia).iso
/
fade into you
/
being there
/
Services
/
WWW
/
web wanderers
/
web wanderers
< prev
next >
Wrap
Text File
|
1994-10-01
|
14KB
|
385 lines
[IMAGE] This document is part of The Web at Nexor
_________________________________________________________________
LIST OF ROBOTS
This is a list of Web Wanderers. See also World Wide Web Wanderers,
Spiders and Robots.
If you know of any that aren't on this list, please let me know.
If you find you have been visited by a robot and you want to be
excluded from the searches please mail the author directly.
_________________________________________________________________
The JumpStation Robot
Run by Jonathon Fletcher <J.Fletcher@stirling.ac.uk>.
Verion I has been in development since September 1993, and has been
running on several occasions, the last run was between February the
8th and February the 21st.
More information, incuding access to a searcheable database with
titles can be found on The Jumpstation
Identification: It runs from pentland.stir.ac.uk, has "JumpStation" in
the User-agent field, and sets the From field.
Version II is under development..
_________________________________________________________________
Repository Based Software Engineering Project Spider
Run by Dr. David Eichmann <eichmann@rbse.jsc.nasa.gov> For more
information see the Repository Based Software Engineering Project.
Consists of two parts:
Spider
A program that creates an Oracle database of the Web graph,
traversing links to a specifiable depth (defaults to 5 links)
beginning at a URL passed as an argument. Only URLs having
".html" suffixes or tagged as"http:" and ending in a slash are
probed. Unsuccessful attempts and leaves are logged into a
separate table to prevent revisiting. This is effectively then,
a limited-depth breadth-first traversal of only html portions
of the Web. We err on the side of missing non-obvious html
documents in order to avoid stuff we're not interested in. A
third table provides a list of pruning points for hierarchies
to avoid because of discovered complexity, or hierarchies not
wishing to be probed.
Indexer
A script that sucks html URLs out of the database and feeds
them to a modified freeWAIS waisindex, which retrieves the
document and indexes it. Retrieval support is provided by a
front page and a cgi script driving a modified freeWAIS
waissearch.
The separation of concerns is to allow spider to be a lightweight
assessor of Web state, while still providing the value added to the
general community of the URL search facility.
Identification: it runs from rbse.jsc.nasa.gov (192.88.42.10),
requests GET /path RBSE-Spider/0.1", with a and uses a
RBSE-Spider/0,1a in the User-Agent field.
Seems to retrieve documents more than once.
_________________________________________________________________
The WebCrawler
Run by Brian Pinkerton <bp@biotech.washington.edu>
Identification: It runs from fishtail.biotech.washington.edu , and
uses WebCrawler/0.00000001 in the HTTP User-agent field.
It does a breadth-first walk, and indexes content as well as URLs etc.
For more information see description, or search its database.
_________________________________________________________________
The NorthStar Robot
Run by Fred Barrie <barrie@unr.edu> and Billy Barron.
More information including a search interface is available on the
NorthStar Database. Recent runs (26 April) will concentrate on
textual analysis of the Web versus GopherSpace (from the Veronica
data) as well as indexing.
Run from frognot.utdallas.edu, possibly other sites in utdallas.edu,
and from cnidir.org. Now uses HTTP From fields, and sets User-agent to
NorthStar
_________________________________________________________________
W4 (the World Wide Web Wanderer)
Run by Matthew Gray <mkgray@mit.edu>
Run initially in June 1993, it's aim is to measure the growth in the
web. See details and the list of servers
User-agent: WWWWanderer v3.0 by Matthew Gray <mkgray@mit.edu>
_________________________________________________________________
The fish Search
Run by people using the version of Mosaic modified by Paul De Bra
<debra@win.tue.nl>
It is a spider built into Mosaic. There is some documentation online.
Identification: Modifies the HTTP User-agent field. (Awaiting details)
_________________________________________________________________
The Python Robot
Written by Guido van Rossum <Guido.van.Rossum@cwi.nl>
Written in Python. See the overview
_________________________________________________________________
html_analyzer-0.02
Run by James E. Pitkow <pitkow@aries.colorado.edu>
Its aim is to check validity of Web servers. I'm not sure if it has
ever been run remotely.
_________________________________________________________________
MOMspider
Written by Roy Fielding <fielding@ics.uci.edu>
It's aim is to assist maintenance of distributed infostructures (HTML
webs). It has it's own page.
_________________________________________________________________
HTMLgobble
Maintained by Andreas Ley <ley@rz.uni-karlsruhe.de>
A mirroring robot. Configured to stay within a directory sleeps
between requests, and the next version will use HEAD to check if the
entire document needs to be retrieved.
Identification: Uses User-Agent: HTMLgobble v2.2, and it sets the From
field. Usually run by the author, from tp70.rz.uni-karlsruhe.de.
Now source is available (but unmaintained).
_________________________________________________________________
WWWW - the WORLD WIDE WEB WORM
Maintained by Oliver McBryan <mcbryan@piper.cs.colorado.edu>.
Another indexing robot, for which more information is available.
Actually has quite flexible search options.
Awaiting identification information (run from piper.cs.colorado.edu?).
_________________________________________________________________
WM32 Robot
Run by Christophe Tronche <Christophe.Tronche@lri.fr>
It has it's own page. Supposed to be compliant with the proposed
standard for robot exclusion.
Identification: run from hp20.lri.fr, User-Agent W3M2/0.02 and From
field is set.
_________________________________________________________________
Websnarf
Maintained by Charlie Stross <charless@sco.com>
A WWW mirror designed for off-line browsing of sections of the web.
Identification: run from ruddles.london.sco.com.
_________________________________________________________________
The Webfoot Robot
Run by Lee McLoughlin <L.McLoughlin@doc.ic.ac.uk>
First spotted in Mid February 1994.
Identification: It runs from phoenix.doc.ic.ac.uk
Further information unavailable.
_________________________________________________________________
Lycos
Owned by Dr. Michael L. Mauldin <fuzzy@cmu.edu> at Carnegie Mellon
University.
This is a research program in providing information retrieval and
discovery in the WWW, using a finite memory model of the web to guide
intelligent, directed searches for specific information needs.
You can search the Lycos database of WWW documents, which currently
has information about 390,000 documents in 87 megabytes of summaries
and pointers.
More information is available on its home page.
Identification: User-agent "Lycos/x.x", run from fuzine.mt.cs.cmu.edu.
Lycos also complies with the latest robot exclusion standard.
_________________________________________________________________
ASpider (Associative Spider)
Written and run by Fred Johansen <fred@nvg.unit.no>
Currently under construction, this spider is a CGI script that
searches the web for keywords given by the user through a form.
Identification: User-Agent: "ASpider/0.09", with a From field
"fredj@nova.pvv.unit.no".
_________________________________________________________________
SG-Scout
Introduced by Peter Beebee <ptbb@ai.mit.edu, beebee@parc.xerox.com>
Run since 27 June 1994, for an internal XEROX research project, with
some information being made available on SG-Scout's home page
Does a "server-oriented" breadth-first search in a round-robin
fashion, with multiple processes.
Identification: User-Agent: "SG-Scout", with a From field set to the
operator. Complies with standard Robot Exclusion. Run from
beta.xerox.com.
_________________________________________________________________
EIT Link Verifier Robot
Written by Jim McGuire <mcguire@eit.COM>
Announced on 12 July 1994, see their page.
Combination of an HTML form and a CGI script that verifies links from
a given starting point (with some controls to prevent it going
off-site or limitless).
Seems to run at full speed...
Identification: version 0.1 sets no User-Agent or From field. From
version 0.2 up the User-Agent is set to "EIT-Link-Verifier-Robot/0.2".
Can be run by anyone from anywhere.
_________________________________________________________________
ANL/MCS/SIGGRAPH/VROOM Walker
Owned/Maintained by Bob Olson <olson@mcs.anl.gov>
This robot is gathering data to do a full-text index with glimpse and
provide a Web interface for it.
Identification: sets User-agent to "ANL/MCS/SIGGRAPH/VROOM Walker",
and From to "olson.anl.gov".
Another rapid-fire robot that doesn't use the robot exclusion
protocol. Depressing. Improvements awaited.
_________________________________________________________________
WebLinker
Written and run by James Casey <casey@ptsun00.cern.ch>
It is a tool called 'WebLinker' which traverses a section of web,
doing URN->URL conversion. It will be used as a post-processing tool
on documents created by automatic converters such as LaTeX2HTML or
WebMaker. More information is on its home page.
At the moment it works at full speed, but is restricted to local
sites. External GETs will be added, but these will be running slowly.
WebLinker is meant to be run locally, so if you see it elsewhere let
the author know!
Identification: User-agent is set to 'WebLinker/0.0 libwww-perl/0.1'.
_________________________________________________________________
Emacs w3-search
Written by William M. Perry <wmperry@spry.com>
This is part of the w3 browser mode for Emacs, and half implements a
client-side search for use in batch processing, there is no
interactive access to it.
For more info see the Searching section in the Emacs-w3 User's Manual.
I don't know if this is ever actually used by anyone...
_________________________________________________________________
Arachnophilia
Run by Vince Taluskie <taluskie@utpapa.ph.utexas.edu>
The purpose (undertaken by HaL Software) of this run was to collect
approximately 10k html documents for testing automatic abstract
generation. This program will honor the robot exclusion standard and
wait 1 minute in between requests to a given server.
Identification: Sets User-agent to 'Arachnophilia', runs from
halsoft.com.
_________________________________________________________________
Mac WWWWorm
Written by Sebastien Lemieux <lemieuse@ERE.UMontreal.CA>
This is a French Keyword-searching robot for the Mac, written in
HyperCard. The author has decided not to release this robot
publically.
Awaiting identification details.
_________________________________________________________________
churl
Maintained by Justin Yunke <yunke@umich.edu>
A URL checking robot, which stays within one step of the local server,
see further information.
Awaiting identification details.
_________________________________________________________________
tarspider
Run by Olaf Schreck <chakl@fu-berlin.de> (Can be fingered at
chakl@bragg.chemie.fu-berlin.de or
olafabbe@w255zrz.zrz.tu-berlin.de)
Sets User-Agent to "tarspider <version>", and From to
"chakl@fu-berlin.de".
_________________________________________________________________
The Peregrinator
Run by Jim Richardson <jimr@maths.su.oz.au>.
This robot, in Perl V4, commenced operation in August 1994 and is
being used to generate an index called MathSearch of documents on Web
sites connected with mathematics and statistics. It ignores off-site
links, so does not stray from a list of servers specified initially.
Identification: The current version sets User-Agent to
Peregrinator-Mathematics/0.7. It also sets the From field.
The robot follows the exclusion standard, and accesses any given
server no more often than once every several minut es.
A description of the robot is available.
_________________________________________________________________
checkbot.pl
Written by Dimitri Tischenko <D.B.Tischenko@TWI.TUDelft.NL>
Another validation robot.
Sets User-agent to 'checkbot.pl/x.x libwww-perl/x.x' and sets the From
field.
_________________________________________________________________
Martijn Koster